Let’s first see what variables we have.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Now let’s plots some histograms and look at the distributions of some of these variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most wines have a sweetness of less than 3g/dm^3, a median of 2.2 and a mean of 2.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Most of the wines have between 9% and 13% alcohol, a mean of 10.42% and a median of 10.20%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The most common wine quality is 5/10 and 6/10, with a median of 6 and a mean of 5.6. Strange that there are no wines with a quality below 3 or above 8. Let’s confirm this by subsetting the data to filter out the wines appearing above.
## [1] X fixed.acidity volatile.acidity
## [4] citric.acid residual.sugar chlorides
## [7] free.sulfur.dioxide total.sulfur.dioxide density
## [10] pH sulphates alcohol
## [13] quality
## <0 rows> (or 0-length row.names)
There were no rows in the subset which confirms that there are no wines with a quality of less than 3 or more than 8.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Most wines have between 6 and 11 g/dm^3 of fixed acidity (tartaric acid), with a median of 7.9 and a mean of 8.32.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile acidity (acetic acid) distribution looks like it could be bimodal, with peaks around 0.4 and 0.6. The median volatile acidity is 0.39 and the mean is 0.52.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Most wines have citric acid of less than 0.6 g/dm^3. The most common citric acid amount is 0.0 showing that a large number of wines do not contain any. The median is 0.26 and the mean is 0.27.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Most wines have less than 35 mg/dm^3 free sulfur dioxide content. The median is 14.00 and the mean is 15.87.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Most wines have less than 100 mg/dm^3 of total sulfur dioxide. The median is 38 and the mean is 46.47.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH of the red wines appears to be distributed normally. The median is 3.31 and the mean is also 3.31.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density of wine is distributed normally. The median is 0.99 and the mean is 0.99.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Most wines have sulphates between 0.4 and 0.8. The median is 0.62 and the mean is 0.65.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The majority of wines contain between 0.05 and 0.10 g/dm^3 of salt. A small percentage of wines include a much larger proportion of salt (>0.40 g/gm^3). The mean chorides content is 0.087 and the median is 0.079.
The dataset contains 1599 records 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol, quality).
The main feature of interest in this dataset is the reported quality, specifically how quality values increase or decrease with respect to changes in other variables (i.e. what makes a high quality wine?).
The matrix visualisations revealed moderate to strong correlations found to exist between quality, and four other variables variables alcohol, citric acid, sulphates, and volatile acidity. To further explore this interesting finding, additional variables were created from the descriptive statistics (mean, median) of these original variables.
Though the above plots offer a view into the distribution of each variable, it’s difficult to know how interesting each of these variables, viewed independently, may be for further investigation. Let’s create some scatterplot matrix visualizations to see if there are any clues as to which of these variables correlate strongest with the main variable of interest, quality.
The visualization above considers the correlation between the quantitative variables in the dataset. A positive correlation is indicated by blue colors and lines going up and to the right, while negative correlation is indicated by red colors and lines going down and to the right.
As a result of the above visualizations the following correlations with are exposed:
Let’s do some bivariate plotting and analysis to further explore the distributions of some of these variables in greater detail.
Easy to see the moderate correlation (0.67) between fixed acidity and citric acid
It’s clear to see the moderate correlation (0.66) between fixed acidity and density.
Interesting pattern here, moderate correlation (0.66) with variance increasing along with increases in sulfur dioxide.
Negative moderate correlation (-0.54) between citric acid and pH
Here we see the pattern and negative correlation (-0.55) between citric acid and volatile acidity. Volatile acidity decreases as citric acid increases. Since volatile acidity is a marker of spoilage [2] perhaps here we are seeing how citric acid prevents spoilage?
Clear negative correlation (-0.68) between fixed acidity and pH.
We see here again that higher quality wines seem to have a higher alcohol content. Let’s take a look at a boxplot.
It is true that the median and mean alcohol of the highest quality wine (8) is greater than that of the lower quality wines. However, the lowest quality wine does not have the lowest alcohol mean and median, which hints that other factors are likely involved. Also, is alcohol itself the driving factor behind what we see in these boxplots? Or is it some different factor relatable to acholol, such as acidity? Need further investigation to identify what other factors are influencieng quality.
Citric acid seems to have an impact on quality, with higher quality wines having a higher mean and median citric acid content. But what’s the real driver: is it the citric acid itself? Or is it the overall acidity of the wine?
Wines with a greater amount of sulphates have better quality.
Here we see an inverse relationship with pH. That is, the higher pH wines tend to have a lower quality.
The greater the volatile acidity, the lower the wine quality is according to this boxplot. The lowest quality wines (3) in the dataset have more than twice the median volatile acidity of the highest quality wines (8).
Puzzlingly the median and mean total sulfur dioxide peak at the wines with quality 5…
High quality wines are slightly less dense than lower quality wines.
Though it’s still too early to determine causation, let’s list out some factors which seem to directly or indirectly influence wine quality.
Volatile acidity seems like it may lower wine quality. As hinted earlier, volatile acidity is an indicator of wine spoilage which makes sense as to why it might lower wine quality.
Density might lower wine quality. Denser wines have lower quality values.
Citric Acid seems to positively influence a wine’s quality, with higher levels correlating to higher quality values. We also observed an inverse relationship between citric acid and volatile acidity – and supposed that citric acid might play a role in inhibiting wine spoilage. Perhaps it is this preservative function of citric acid that improves wine quality.
Sulphates seem to improve wine quality in higher amounts. Perhaps sulphates inhibit wine spoilage?
Alcohol seems to positively influence wine quality, with higher quality wines tending to have higher alcohol. Possible causes could be preservative effects of alcohol (inhibition of bacterial growth), or simply individual preference for more alcohol.
pH may improve wine quality, as we observed that wines with higher pH levels (less acidic) tended to have better quality values.
Out of the above factors the most interesting relationship is that between citric acid and volatile acidity – as it’s the only relationship with a correlation above 0.5 that can be direcly linked back to quality (Citric acid having a 0.22 correlation with quality and volatile acidity having a -0.39 correlation with quality).
It seems most of the higher quality wines (quality >= 0.6) are concentrated in the lower center of the plot.
In this plot most of the higher quality wines (quality >= 0.6) are in the upper right of the sample. A higher citric acid amount paired with a higher alcohol percentage might make for a better quality wine.
In this plot we can see that most of the higher quality wines (quality >= 0.6) are in the upper left portion of the group.
In this plot, wines seem to rise in quality as the amount of sulphates exceeds 0.6.
There were some interesting observations made in the above multivariate plots.
In the first plot, we see that there’s a noticable concentration of wines with quality greater than or equal to 6 near the bottom of the plot around the middle of the x-axis. This region correspods to wines having low volatile acidity (spoilage) and moderate citric acid.
In the next plot (alcohol, citric acid) the data is more spread out, but wines with quality greater than or equal to 6 look to be more frequent near the upper right corner of the sample. This corresponds to wines with a higher alcohol content, and moderate citric acid content.
Next we plotted alcohol and density. Again, wines with quality greater than or equal to 6 tended to have higher alcohol contents, and here we can see that they also tended to be less dense – with most higher quality wines having a density of less than 1.0.
Finally we plotted sulphates, citric acid, and quality. The pattern here is more subtle but it appears that after exceeding a sulphates level of 0.5 we start to see higher quality wines.
The majority of wines with a quality of >= 6 appear to be concentrated toward the lower portion of the sample, toward the middle of the x-axis. This suggests that low volatile acidity and moderate citric acid are factors of quality wine.
In this plot wines with a quality >= 6 appear to be concentrated in the top right portion of the group, suggesting that a moderate to high alcohol content improves the quality of wine.
Wines with a quality of >= 6 are located mostly in the upper left portion of the sample. This data suggests that wines with a lower density are preferred.
We plotted and explored the input variables in this dataset to discover how they contributed to the output variable of quality. The main finding was that quality is influenced by different variables to varying degrees. Citric acid correlated positively with quality, possibly having an inhibitory effect on spoilage or confering desirable flavor qualities. Alcohol also correlated positively with quality, with moderate to high percentage wines being rated the highest. Sulphates contributed a weak correlation with quality, possibly by helping to prevent spoilage. Density was also a factor, with people mostly preferring less dense wines. The main difficulty in this exploration was identifying exactly how each variable contributed to quality, their combined contribution being difficult to deconstruct. One question is why might a higher alcohol content be correlated with quality?
[0] https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
[1] http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity
[2] http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity
[3] http://www.calwineries.com/learn/wine-chemistry/wine-acids/citric-acid
[4] https://www.winecurmudgeon.com/residual-sugar-in-wine-with-charts-and-graphs/
[5] http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095
[6] http://www.morethanorganic.com/sulphur-in-the-bottle
[7] https://www.etslabs.com/analyses/DEN
[8] https://www.dhs.wisconsin.gov/chemical/sulfates.htm